Linguistic Knowledge and Transferability of Contextual Representations

2019-05-26

本文深入研究了预训练词表征所学习到的语言学知识以及可迁移性，通过大量的对比实验分析ELMo, GPT, BERT等预训练词表征的影响，得出一些有意义的结论。 NAACL2019

Introduction

预训练词表征（ELMo, GPT, BERT）在很多的NLP任务中取得了成功，但目前的研究缺乏深入分析：

linguistic knowledge
transferability

针对于 linguistic knowledge，本文通过17个不同的probing tasks来分析预训练词表征学习到的例如coreference, knowledge of semantic relations, and entity information 等。而针对于transferability，在实践中，基于预训练的语言模型取得了最好的表现，但是我们也可以使用其他的预训练任务；因此，作者在12个任务上预训练，在9个任务上做迁移，来分析不同的预训练任务会如何影响学习到的语言知识。

本文要解决的核心问题如下：

What features of language do these vectors capture, and what do they miss?

How and why does transferability vary across representation layers in contextualizers?

How does the choice of pretraining task affect the vectors’ learned linguistic knowledge and transferability?

通过大量的实验分析，作者得出了以下结论：

在训练好的CWRs(contextual word representation)上添加线性输出层（CWR固定不变），在大部分情况下，其效果不亚于针对特定任务的SOTA模型；但是在一些需要细粒度的语言学知识的任务上会失败，在这些任务上，基于这些任务训练得到的上下文特征能够极大地改善编码需要的知识。
LSTM的第一层更适合做迁移，而Transformer的中间层更适合迁移。
多层LSTM的高层更多是任务特定的，缺乏通用性；而Transformer在任务特定性上并没有表现出相同的单调性。
整体上来说，预训练语言模型在迁移性上优于其他的7个预训练任务，但是对于单个目标任务，在相关任务上预训练会带来最好的效果。

Probing Tasks

作者使用probing task来验证CWR所包含的语言知识，实际上是在CWR上使用单独的模型来预测某个标签，从而证明是否包含某类信息，如图1所示。

论文总共做了17个不同的probing tasks，分为以下几类：

Token Labeling：POS, CCG, ST, Preposition supersense disambiguation, event factuality (EF)
Segmentation: Syntactic chunking (Chunk), Named entity recognition (NER), Grammatical error detection (GED), conjunct identification (Conj)
Pairwise Relations（预测词与词之间的关系）:
- Arc prediction is a binary classification task, where the model is trained to identify whether a relation exists between two tokens.
- Arc classification is a multiclass classification task, where the model is provided with two tokens that are linked via some relationship and trained to identify how they are related.

Models

Probing Model
作者使用线性模型作为Probing Model，目的是减小Probing Model本身的能力，仅关注于CWR。

Contextualizers

ELMo (original): uses a 2-layer LSTM
ELMo (4-layer): uses a 4-layer LSTM
ELMo (transformer): uses a 6-layer transformer
OpenAI transformer
BERT (base, cased): uses a 12-layer transformer
BERT (large, cased): uses a 24-layer transformer

Pretrained Contextualizer Comparison

从表1可以发现：

在所有任务中，CWR均比非上下文词向量（glove）效果好。
Comparing the ELMo-based contextualizers, we see that ELMo (4-layer) and ELMo (original) are essentially even, though both recurrent models outperform ELMo (transformer). OpenAI transformer significantly underperforms the ELMo models and BERT. BERT significantly improves over the ELMo and OpenAI models.
在某些任务（NER）上，CWR不如SOTA模型，作者提出了两点原因：
- the CWR simply does not encode the pertinent information or any predictive correlates （作者认为可以在CWR的基础上，通过学习任务特定的上下文特征来改进）
- the probing model does not have the capacity necessary to extract the information or predictive correlates from the vector.（作者认为可以通过增加模型的复杂度来改进）

为了更好的理解probing model效果不好的原因，作者又做了以下实验：

a contextual probing model that uses a task-trained LSTM (unidirectional, 200 hidden units) before the linear output layer (thus adding task-specific contextualization)
replacing the linear probing model with a multilayer perceptron (MLP; adding more parameters to the probing model: a single 1024d hidden layer activated by ReLU).

以上两组实验分别对应两点原因，参数数量大致相同。同时，作者还把这两个模型相结合，作为一个上限，即
the CWRs are inputs to a 2-layer BiLSTM with 512 hidden units, and the output is fed into a MLP with a single 1024-dimensional hidden layer activated by a ReLU to predict a label。作者使用了ELMo(original) pretrained contextualizer，实验结果如表2所示：In all cases, we see that adding more parameters (either by replacing the linear model with a MLP, or using a contextual probing model) leads to significant gains over the linear probing model.

从上述实验可以发现：在使用CWR finetuning时，针对于不同的目标任务，所需要的输出层是不同的；当预训练任务没有获取目标任务所需要的某些信息时，针对于目标任务的上下文特征是非常重要的（类似于表2中额外的LSTM层）。

However, such end-task specific contextualization can come from either fine-tuning CWRs or using fixed output features as inputs to a task-trained contextualizer; Peters et al. (2019) begins to explore when each approach should be applied.

Analyzing Layerwise Transferability

这组实验主要是验证哪一层的CWR的迁移性更好，作者对每一层的CWR模型使用probing model来预测效果。

The first layer of contextualization in recurrent models (original and 4-layer ELMo) is consistently the most transferable, even outperforming a scalar mix of layers on most tasks.
transformer-based contextualizers have no single most-transferable layer; the best performing layer for each task varies, and is usually near the middle. Accordingly, a scalar mix of transformer layers outperforms the best individual layer on most tasks.

Transferring Between Tasks

这组实验主要是验证不同预训练任务的影响。

整体上说，双向语言模型预训练是最优的，但是相关的预训练任务会带来更好的效果。另一方面，大规模语料也有促进作用。

Conclusion

在训练好的CWRs(contextual word representation)上添加线性输出层（CWR固定不变），在大部分情况下，其效果不亚于针对特定任务的SOTA模型；但是在一些需要细粒度的语言学知识的任务上会失败，在这些任务上，基于这些任务训练得到的上下文特征能够极大地改善编码需要的知识。
LSTM的第一层更适合做迁移，而Transformer的中间层更适合迁移。
多层LSTM的高层更多是任务特定的，缺乏通用性；而Transformer在任务特定性上并没有表现出相同的单调性。
整体上来说，预训练语言模型在迁移性上优于其他的7个预训练任务，但是对于单个目标任务，在相关任务上预训练会带来最好的效果。